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SOME  RESULTS  OF  STATISTICAL  STUDIES  OF  A 
DESCRIPTOR  LANGUAGE 

V.  K.  Vakhabov 

The  application  of  statistical  methods  for  the  investigation  of 
artificial  Informational  languages  can  be  promising  in  selecting 
rational  methods  for  the  organization  of  storage  in  machine  retrieval. 
Thus  in  [1]  some  statistical  adherences  to  the  law  du ving  descriptor 
retrieval  are  examined  theoretically,  and  an  example  of  their 
utilization  for  the  analysis  of  the  inverse  method  of  search  is 
analyzed.  As  the  basis  for  the  cited  adherences  to  the  law  two 
hypotheses  were  posed:  each  descriptor  of  the  information -retrieval 
language  has  one  and  the  same  probability  of  appearance  in  the  search 
patterns  of  documents  and  inquiries;  the  probability  of  the  use  of 
descriptors  in  the  search  patterns  of  documents  and  inquiries  is 
subordinated  to  the  laws  of  Zipf  and  Mandelbrot . 

Below  the  results  of  an  experimental  check  of  these  assumptions 
and  adherences  to  the  law  deduced  from  them  are  described.  At  the 
same  time  the  influence  of  the  number  of  keywords  in  the  class  of 
equivalency,  the  number  of  hierarchical  references  to  the  descriptor, 
and  the  frequency  of  occurrence  of  keywords  on  the  frequency  of  use 
of  the  descriptors  have  also  been  investigated. 

The  Zipf  law  of  hyperbolic  distribution,  to  which  word  frequency 
in  the  texts  of  natural  language  is  subordinated,  is  used  extensively 
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in  .statistical  linguistics.  It  proves  to  be  suitable  for  the 
description  of  the  most  diverse  phenomena,  for  example,  it  is 
steadily  fulfilled  in  natural  languages  for  the  distribution  of  word 
forms  and  lexicon  -  the  totality  of  the  word  forms  of  one  changed 
word  ['2] . 

It  has  been  shown  in  [33  and  [1»]  that  the  Zipf  law  also  extends 
to  artificial  descriptor  informational  languages  relative  to  the 
distribution  of  descriptors  in  the  search  patterns  of  documents. 


In  works  [53  and  [63  the  generality  of  the  Zipf  law  with  the 
Lotka  distributive  law  for  the  productivity  of  scientists  and 
witi.  the  Bradford  law  of  the  scattering  of  information  in  Journals 
is  shown,  and  in  a  number  of  other  cases  emerging  beyond  the  frames 
of  statistical  linguistics.  The  universal  nature  of  the  Zipf  law 
permits  the  assumption  of  its  deducibility  from  the  properties  of 
any  common  model  lying  at  the  basis  for  all  these  phenomena. 


One  of  such  models  has  been  constructed  by  Mandelbrot  [7,  83  on 
the  basis  of  the  representations  of  the  bond  theory.  Words  are 
considered  as  information  being  transmitted  by  a  certain  communication 
channel.  Words  are  formed  with  the  help  of  an  ergodic  source  of 
characters  and  are  separated  by  spaces.  With  the  optimum  coding 
of  words  by  characters,  i.e.,  with  the  agreement  of  code  with 
channel,  when  its  maximum  Informational  capacity  is  ensured,  the 
a  priori  probability  of  the  appearances  of  the  i-th  word  is 
determined  by  formula 


Pi 


K  . 
(«+0T 


!<<<«, 


(1) 


where  n  -  the  volume  of  the  dictionary; 


K,  B,  y  —  constants,  whereupon  B  «  n,  and  1  £  y  s  1.2. 

At  B  ■  0  and  y  *  1  the  Mandelbrot  formula  (1)  coincides  with  the 
empirical  formula  of  Zipf. 
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The  Mandelbrot  model  explains  the  asymptotic  nature  of  the  Zlpf 
law.  The  value  of  y  In  formula  (1)  Is  Interpreted  as  the  measure 
of  the  unbalance  of  the  frequencies  of  words  and  characters.  Prom 
the  positions  of  the  Mandelbrot  model  it  is  possible  to  explain  the 
significant  increase  of  y  (up  to  1.6)  in  the  speech  of  children  and 
mental  patients,  which  apparently  differs  substantially  from  the 
condition  of  optimum  coding.  The  Mandelbrot  law  is  in  a  better 
agreement  with  the  empirical  dis  tributlons  of  words  in  texts  than  the 
Zlpf  law,  especially  in  the  area  of  frequent  words,  where  (i  &  15) 
the  Zipf  formula  is  generally  uncertain. 

The  hypothesis  which  lies  as  the  basis  for  the  MandelDrot  model 
is  disputed  in  [9]:  here  it  Is  ahown  that  the  actual  speech  process 
is  not  determined  by  the  property  of  optimum  coding.  Furthermore 
the  Mandelbrot  model  is  not  universal. 

A  more  common  approach  to  the  derivation  of  the  Zipf  law  is 
given  in  work  [5],  where  a  model  of  the  generation  of  texts  by 
machine  is  examined.  Here  the  concept  of  the  complexity  of  the 
generation  of  signs  and  texts  is  introduced,  during  which  the  proba¬ 
bility  of  the  generation  of  text  is  a  continuous  decreasing  function 
of  its  complexity.  An  inverse  proportionality  is  postulated  between 
the  average  complexity  of  text  and  its  volume,  and  also  the  exponential 
dependence  between  the  quantity  of  the  signs  of  the  alphabet  and 
complexity.  Under  these  conditions  the  appearance  frequency  of  signs 
in  text  converges  in  probability  to  an  expression  analogous  to  the 
Zipf  law.  The  advantage  of  the  Schreider  model  is  its  greater 
generality  than  the  Mandelbrot  model. 

Actually  the  concept  of  complexity  is  apparently  natural  for 
language.  However,  during  a  study,  for  example,  of  a  real  descriptor 
language  the  complexity  which  causes  the  frequency  of  descriptors 
cannot  be  interpreted  by  any  one  factor,  but  is  defined  as  the 
generalized  resulting  characteristic  of  the  random  interaction  of  a 
number  of  factors  (the  number  of  keywords  in  the  class  of  equivalency 
of  the  descriptor,  the  presence  of  the  hierarchical  bonds  of  this 
descriptor  with  others,  etc.). 
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In  examining  speech  or  text  In  any  language  as  a  complex  random 
process  depending  on' many  random  and  randomly  Interacting  factors. 

It  is  possible  to  assume  that  this  process  is  described  by  Bernoulli's 
well-known  model.  The  derivation  of  the  Zipf  law  with  the  utilization 
of  a  diagram  of  the  Independent  trials  of  Bernoull  was  made  by 
Andryushchenko  [10],  who  emphasized  that  the  probability  of  the 

appearance  of  the  N-th  word  of  text  r  times  is  determined  from  the 
formula  of  negative  binomial  distribution  [11]: 


(2) 

where  p  -  the  probability  of  the  appearance  of  the  given  word;  q  ■  1  - 

Negative  binomial  distribution  in  the  case  of  a  finite  dictionary 
Is  reduced  to  logarithmic  distribution  with  the  absent  zero  class  [H], 
to  which  the  frequencies  correspond. 

a"1  V . £}•  (3) 

In  a  large  dictionary,  when  p  is  small  and  q  1,  a  harmonic  series 
is  obtained  which  corresponds  to  the  Zipf  law.  In  [10]  the  systematic 
divergences  of  empirical  distributions  from  the  laws  of  Zipf  and 
Mandelbrot  are  explained  by  the  fact  that  In  reality  q  f  1. 

In  [10]  It  is  also  shown  that  if  we  evaluate  parameter  y  in  the 
Mandelbrot  formula  (1)  from  the  correlation 


then  y  proves  to  be  the  increasing  function  from  I.  This  agrees 
completely  with  emperlcal  data  of  Prumkina  [2]  concerning  the 
increase  of  y  with  the  increase  in  the  volume  of  the  dictionary. 

Bernoulli’s  system  at  small  p  and  q  -*■  1  apparently  is  that 
generalized  model  which  gives  rise  to  the  law  of  hyperbolic  distri¬ 
bution.  This  system  Is  apparently  applicable  for  the  derivation 
of  the  distributive  law  of  descriptors  in  documents  and  requests. 


EXPERIMENTAL  CHECK  OP  THE  LAW  OP  DISTRIBUTION 
OP  DESCRIPTORS  IN  DOCUMENTS 

The  distribution  of  descriptors  in  the  search  patterns  of 
documents  has  been  checked  experimentally  on  a  group  of  1600  documents 
on  computer  technology  which  was  indexed. by  a  dictionary  which  unites 
ll8l  keywords  into  578  classes  of  equivalency  which  correspond  to 
the  descriptors.  Descriptors  in  the  dictionary  are  partially  ordered 
by  the  hierarchical  relations  "common  -  particular." 

During  the  indexing  of  documents  the  following  rule  is  observed: 
when  in  the  search  pattern  of  a  document  there  is  a  descriptor 
having  reference  to  a  higher  descriptor,  the  latter  is  also  included 
in  the  search  pattern.  Therefore  higher  descriptors  always  have  a 
frequency  no  less  than  a  subordinate. 

Into  the  search  patterns  of  1600  documents  502  descriptors 
entered  19,030  times,  which  corresponds  to  an  average  depth  of 
Indexing  of  12  descriptors  per  document. 

The  chart  of  the  frequency  distribution  of  occurence  of 
descriptors  in  the  search  patterns  of  documents  is  shown  in  Fig.  1, 
from  which  it  Is  evident  that  the  curve  which  corresponds  to  formula 
(1)  of  Mandelbrot  describes  somewhat  better  the  real  distribution, 
especially  In  the  range  of  large  and  medium  frequencies.  Thus  the 
chart  testifies  to  the  doubtless  applicability  of  the  laws  of  Zlpf 
and  Mandelbrot  to  the  distribution  of  descriptors  in  the  search 
patterns  of  documents. 

Fig.  1.  The  distribution  of 
descriptors  In  the  search  patterns 
of  document:-:  1  —  the  Zipf  law 

2  —  the  Mandelbrot  law  (cto?.,  • 

3  -  the  empirical  distribution  law 
of  descriptors  in  documents. 
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The  reliability  of  the  results  obtained  can  be  evaluated  by 
using  the  formula  of  the  relative  error  of  the  measurement  of 
frequency  during  the  analysis  of  the  sample  [2] 


(5) 


where  N  —  the  sample  size;  p  -  the  frequency  which  should  i>e  determined 
according  to  the  given  sample;  6  -  the  relative  error  of  the  measure¬ 
ment  of  frequency;  p  -  the  confidence  coefficient,  from  wh tch  Zp  is 
determined  [12]. 


At  p  *  0.95  it  is  possible  to  take  Zp  ■  2. 

It  is  evident  from  formula  (5)  that  the  authenticity  af  the 
determination  of  frequency  is  lowered  for  low-frequency  deaorlptors. 
Under  the  condition  that  relative  error  does  not  exceed  the  given 
val  ;o  of  Sq  it  is  possible  to  compute  the  minimum  probability  Pmln  of 
the  descriptor,  for  which 

(6) 


Considering  that  the  probability  is  determined  by  the  Zipf  law 
Pj^  -  K/i,  it  is  possible  to  find  the  number  of  the  descriptor  for 
which  the  condition  5  £  5Q  is  still  satisfied: 


(7) 


it  is  also  possible  to  determine  what  fraction  of  sample  size 
is  occupied  by  the  descriptors  whose  frequencies  have  been  determined 


with  the  error  6  i  fiQ: 


<•  1 

A-K  2l* 
«■! 


iiiv+r- 


i*i 


(8) 


where  n  -  the  volume  of  dictionary;  C  ■  0.5772. 
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Having  taken  6  »  0.3  and  taking  into  account  that  N  ■  19,030, 
n  *  502,  we  obtain  inp  ■  63,  A  ■  70?;  i.e.,  70?  of  samples  from  19,030 
descriptor-entries  are  made-up  of  63  descriptors  whose  frequencies 
have  been  determined  with  a  relative  error  no  higher  than  0.3.  Such 
an  accuracy  is  completely  sufficient  for  obtaining  qualitative 
conclusions  about  the  distribution  law. 

ANALYSIS  OP  FACTORS  WHICH  INFLUENCE  THE  DISTRIBUTION 
OF  DESCRIPTORS  IN  DOCUMENTS 


During  research  on  the  distribution  of  descriptors  in  documents 
a  study  was  made  of  the  influence  of  the  number  of  keywords  in  the 
.class  of  equivalency  of  the  given  descriptor  on  the  frequency  of 
entrance  of  the  descriptor  into  the  search  samples  of  documents; 
the  presence  of  references  to  the  given  descriptor  from  subordinate 
descriptors;  the  mean  frequency  of  appearance  of  the  keywords  of  the 
given  descriptor  in  the  search  patterns  of  documents. 


For  evaluating  the  influence  of  each  of  these  factors  for 
every  descriptor,  besides  the  frequency  of  its  entrance  p  into  the 
search  patterns  of  documents,  the  following  were  determined:  t  - 
the  number  of  keyword  In  the  class  of  equivalency  of  the  descriptor 
and  tj.  —  the  number  of  keywords  entering  Into  the  classes  of 
equivalency  of  the  given  descriptor,  and  also  all  descriptors  sub¬ 
ordinate  to  it.  Apart  from  these  characteristics,  for  every 
descriptor  H  ■  tr  -  t  „  —  the  number  of  keywords  in  the  descriptors 

subordinate  to  that  given  was  computed;  f  »  £  —  the  mean  frequency 


of  appearance  of  the  keyword  of  the  given  descriptor  in  the  search 
patterns  of  documents. 


It  should  be  noted  that  value  W  characterizes  the  number  of 
hierarchical  references  to  the  given  descriptor,  and  tj.  -  simulta¬ 
neously  both  the  first  factors  —  the  number  of  keywords  in  the 
classes  of  equivalency  of  the  descriptor  and  the  number  of  references 
to  it. 
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isince  all  these  factors  have  a  random  nature,  for  the  quantitative 
evaluation  of  each  of  them  for  tne  frequency  of  entrance  ol'  descriptors 
it  Is  advantageous  to  use  the  methods  of  correlation  analysis.  The 
degree  of  influence  of  each  of  tnese  factors  v  on  frequencj  can  be 
evaluated  based  on  the  value  of  che  correlation  factor,  calculated 
in  the  method  of  the  moments  of  products  [13]. 


wuere  v^,  p.^  —  the  divergence  of  the  investigated  factor  v  and 
frequency  p  from  their  mathematical  expectations;  n  -  the  number  of 
descriptors  in  the  dictionary. 


However,  it  is  more  convenient  to  use  the  coefficient  of  grade 
correlation  which  practically  does  not  depend  on  the  measuring  error 
of  the  investigated  factors.  When  using  this  method  [13]  signs 
being  compared  are  ranked  by  a  decrease  (increase)  of  values,  where¬ 
upon  they  operate  no  longer  by  values,  but  by  ranks.  The  coefficient 
of  grade  correlation  Is  calculated  according  to  the  formula 


, 


(10) 


where  n  —  the  value  of  the  greatest  rank  (in  this  case  —  the  volume 
of  the  descriptor  dictionary);  dj  -  the  difference  between  the  ranks 
of  frequency  and  Investigated  factor  of  the  i-th  descriptor . 


If  the  measured  factors  have  identical  values  in  a  series  of 
points,  for  the  computation  of  the  coefficient  of  rank  correlation 
a  generalized  formula  is  used  which  is  derived  taking  into  account 
the  association  of  rankB  [13] 


p- 


_ f«l 


V,(£r1-ir.)(sTi-,T") 


(11) 


where  *  —  the  number  of  repeated  values  of  the  factor;  t.  - 

J 

the  number  of  ranks,  united  by  the  J  value  of  factor 
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r 


The  correlation  factor  in  tne  method  of  the  momenta  of  products 
is  connected  with  the  coefficient  of  rank  correlation  by  the 
expression 

r=u»m(^).  (3.2 ) 

The  results  of  the  computations  of  the  correlation  factors  of 
•  frequency  and  enumerated  factors  are  given  in  Table  1. 

»  Table  1.  The  degree  of  influence 


of  various  factors  on  the  freauency 
of  occurrence  of  descriptors  in  the 
search  patterns  of  documents. _ 


"nr—  w 

KosM«aiie«T  p**- 

rartpk  ROppeMSNN 

k 

aim  no  cnocoOy  mo> 

MCNTOO  npOIOSCAtRIlA 

*oai 

4-0,860 

+0.676 

+0,724 

+0.738 

/oc„ 

[+#.«0 

+0.475 

+0,866 

+0,856 

KEY:  (1)  Name  of  factor;  (2) 

Coefficient  of  rank  correlation; 

(3)  Correlation  factor  in  the 
method  of  the  moments  of  products. 

Figure  2  gives  the  regression  curves  constructed  for  the  indicated 
characteristics  of  these  factors  on  various  intervals  of  frequency 
of  the  descriptors.  A  comparison  of  the  correlation  factors  and 
regression  curves  permits  the  following  conclusions  to  be  made. 


i 

! 


i 


Fig.  2.  The  regression  curves  of 
factors  which  determine  the 
frequency  of  descriptors  in  documents. 
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Each  of  three  factors  -  the  avera go  frequency  of  appearance  of 
keywords  in  the  search  patterns  of  documents,  the  number  of  keywords 
in  the  class  of  equivalency  of  the  descriptor,  the  number  of  references 
to  the  given  descriptor  on  the  part  of  subordinates  -  substantially 
influences  the  frequency  of  entrance  of  a  descriptor  into  the  search 
patterns  of  documents. 

The  number  of  hierarchical  bonds  of  a  descriptor,  measured  by 
the  number  of  keywords  in  the  descriptors  subordinate  to  that  given, 
Influences  the  frequency  of  entrance  somewhat  less  than  the  remaining 
factors.  This  is  explained  primarily  by  the  fact  that  in  the 
descriptor  dictionary  in  question  th“  base  lines  of  communications 
are  encountered  comparatively  rarely.  Thus  from  502  descriptors  only 
60  have  the  references  of  subordinate  descriptors.  However,  the 
analysis  of  regression  curves  t.t  H  and  tj,  -  tQcH  (see  Pig.  2)  in 
the  area  of  the  frequent  descriptors  shows,  the  majority  of  which 
have  references  to  subordinates,  the  influence  of  the  number  of 
references  on  the  frequency  of  a  descriptor  proves  to  be  more 
significant  than  the  number  of  keywords  in  the  class  of  equivalency 
of  the  descriptor.  The  rare  descriptors  whose  area  composes  the  greater 
part  of  the  dictionary  barely  have  references,  and  the  influence  of 
factor  M  -  tj.  -  tQCH  in  this  area  is  negligibly  small. 

CHECK  OF  THE  LAW  OF  DISTRIBUTION  OF  DESCRIPTORS 
IN  REQUESTS 

The  distribution  of  descriptors  in  requests  has  also  been 
studied  experimentally.  Two  files  of  requests  on  computer  technology 
were  examined:  the  artificial  request,  made  up  lor  the  check  of  the 
quality  of  work  of  the  IRS  [Information  Retrieval  System],  and  real 
demands  of  specialists,  assembled  for  input  into  the  system  for  the 
selective  distribution  of  information. 

In  the  first  group  there  were  124  requests  with  a  total  of  546 
inputs  of  descriptors,  in  the  second  group  -  113  requests  having 
355  descriptor-entries. 
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The  direct  check  of  the  distribution  law  on  such  small  samples 
is  uncertain.  The  check  of  the  applicability  of  the  Zipf  law  for 
the  distribution  of  descriptors  in  requests  was  accomplished  by  the 
indirect  method,  connected  with  the  experimental  check  of  some 
functionals  reflecting  the  statistical  structure  of  text  and  calculated 
in  the  execution  of  the  Zipf  law.  The  number  of  such  functionals  [14] 
include  n(N)  -  the  number  of  various  words  in  a  text  out  of  N  words, 
and  n(m,  N)  —  the  number  of  words  which  are  encountered  m  times  in 
a  text  out  of  N  words. 


In  Table  2  the  experimental  values  of  these  functionals  are 
compared  with  their  mathematical  expectations,  calculated  using 
formulas  [14] 


Mn 


(m  V  i 

(in,  «  2i  yi \  n  I  +  2  i 


j=o 
at  m>2. 


,2* 

Mn{N)=a  (I  —  /  ■  )  +  Mn  (I,  N)~  - 1~  *  . 


(13) 


which  are  valid  in  the  fulfillment  of  the  Zipf  law.  In  formulas  (13) 
n  -  the  number  of  descriptors  in  the  dictionary;  K*(inn+C)— 
constant  in  the  formula  of  the  Zipf  law. 


Table  2.  The  experimental  and  calculated 
values  of  the  functionals  bound  with  the 
distribution  of  descriptors  in  requests. 
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KEY:  (1)  Name  of  functional;  (2)  Request 

of  the  1st  group;  (3)  calculation;  (4) 
experiment;  (5)  Request  of  the  2nd  group. 
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The  satisfactory  agreement  of  the  experimental  and  calculated 
value.;  of  the  functionals  testifies  to  the  applicability  of  the  Zipf 
law  to  the  distribution  of  descriptors  In  requests. 

The  appearance  frequency  of  every  descriptor  in  the  requests 
apparently  depends  in  the  same  manner  on -the  quantity  of  keywords  in 
the  class  of  equivalency  of  the  given  descriptor  and  the  medium 
frequency  of  appearance  of  the  keywords  of  descriptors  in  request . 

However,  the  limited  amount  of  experimental  material  on  requests 
does  not  permit  a  reliable  quantitative  evaluation  of  the  Influence 
o:'  these  factors  on  the  frequency  of  appearance  of  descriptors  in 

requests . 

COMPARISON  OP  THE  LAWS  OP  DISTRIBUTION  OP 
DESCRIPTORS  IN  DOCUMENTS  AND  REQUESTS 

The  hypothesis  concerning  the  agreement  of  the  laws  of  dis¬ 
tribution  of  descriptors  in  documents  and  requests  in  essence  is 
equivalent  to  the  assumption  about  the  statistical  proximity  of 
frequency  descriptor  dictionaries  for  documents  and  requests . 

The  degree  of  statistical  proximity  of  frequency  dictionaries 
also  can  be  established  by  the  method  of  rank  correlation. 

The  computation  of  the  correlation  factors  for  both  groups  of 
requests  gives  the  following  results: 

experimental  requests; 

p-  +0,862  f-  +0.870 

real  requests 

P-+0.69  r- +0.706 

The  values  of  the  correlation  factors  and  histograms,  constructed 
on  both  series  of  requests  (Pig.  3),  testify  to  the  significant 
statistical  proximity  of  frequency  dictionaries  of  the  descriptors 
of  documents  and  requests  concerning  the  truth  of  the  accepted 
hypothesis . 


The  degree  of  divergence  of  the  frequency  dictionary  of  real 
requests  from  the  dictionary  of  documents  is  greater  than  the 
dictionary  of  experimental  requests.  An  analysis  establishes  the 
reasons  for  these  divergences.  Thus  the  descriptor  "microelectronics" 
in  the  dictionary  of  real  requests  is  one  of  the  most  frequent  and 
has  a  rank  of  5;  in  the  file  of  documents  its  frequency  is  low-rank 
101.  Therefore  one  of  the  reasons  for  the  divergence  of  frequencies 
of  descriptors  in  requests  and  documents  is  the  fact  that  the  fund 
content  always  unavoidably  lags  behind  the  requirement  of  the 
specialists . 

EXPERIMENTAL  CHECK  OF  CONFORMITIES  OF  THE 
LAWS  OF  DISTRIBUTION  OF  DESCRIPTORS 
INSIDE  THE  REQUEST 

After  the  analysis  of  t  e  hypotheses  concerning  the  applicability 
of  the  ZIpf  law  to  the  distribution  of  descriptors  In  the  retrieval 
patterns  of  documents  and  requests,  and  also  on  the  proximity  of  the 
given  distributions  between  themselves  on  the  same  files  of  requests 
and  documents,  checks  were  made  of  the  values  of  the  functionals 
derived  from  these  hypotheses  which  reflect  the  conformities  of  the 
distribution  of  descriptors  inside  the  request  [1]. 

One  of  such  functionals  is  the  mathematical  expectation  of  a 
number  (or  a  rank  based  on  a  frequency  dictionary)  Mi(J,  z)  of  the 
descriptor  which  occupies  the  J  place  in  the  requests  from  z 
descriptors.  It  is  assumed  that  the  descriptors  in  the  requests 
are  arranged  in  increasing  rank  in  the  frequency  dictionary  which 
is  constructed  on  the  documents. 
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Another  such  functional  is  the  mathematical  expectation  of  a 
number  of  occurrences  MN(j  ,  z)  of  the  J  descriptor  in  the  request 
from  z  descriptors  in  the  retrieval  patterns  of  the  documents. 

Under  the  indicated  assumptions  these  values  are  determined  from 
the  formulas 

«-»+/ 

*<</.  *)*  ^iz\  1*  tn  AV*'  ( I — *  *  Aiy-J 

•-rt-J  „ 

MNU.i)*  2  MC*L\*p(KtnAt)f-‘{l-KMtprl,  (14) 

<=/ 

where  n  -  the  number  of  descriptors  in  the  dictionary;  K  ■  (In  n  +  C)-1 
-  constant  in  the  formula  of  the  Zlpf  law;  C  ■  0.5772  -  Euler 
constant,  where  upon  A  *  ec;  N  ■  fs  —  the  total  number  of  occurrences 
of  descriptors  in  the  retrieval  patterns  of  documents,  here  f  -  the 
average  depth  of  indexing  of  documents;  S  -  the  number  of  documents 
in  the  file. 

For  the  experimental  determination  of  the  value  of  these 
functionals  all  the  requests  were  divided  into  groups.  Each  group 
Included  requests  containing  an  identical  number  of  descriptors. 

The  latter  were  ranked  according  to  the  frequency  dictionary, 
constructed  on  the  retrieval  patterns  of  documents.  Then  In  every 
group  of  requests  for  all  descriptors  standing  at  the  j-th  place 
|i</<*l.  calculations  were  made  of  the  mean  values  of  ranks  and  fre¬ 
quencies  of  occurrence  of  these  descriptors  in  the  search  patterns 
of  documents  and  they  were  compared  with  the  calculated  values  of 
functionals  obtained  from  formulas  (14).  From  Pigs.  4  and  5,  in 
which  the  results  of  the  comparisons  are  shown,  it  is  evident  that 
the  values  of  mathematical  expectations  obtained  from  formulas  (14) 
have  an  identical  order  to  the  mean  values  found  experimentally. 

Thus  formulas  (14)  can  be  considered  suitable  for  estimated  calculations. 

It  should  be  noted  that  on  the  basis  of  the  total  Beries  of  237 
requests  it  is  not  possible  to  obtain  greater  accuracy  in  determining 
the  mathematical  expectation  of  the  investigated  functionals  based 
on  their  mean  values. 
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Fig.  4.  Pig.  5. 

Fig.  4.  Mathematical  expectation  of  the  number  of  the 

J-th  descriptor  of  a  request  in  the  dictionary:  - 

experimental  value;  -  calculated  value. 

Pig.  5.  Mathematical  expectation  of  the  number  of 
occurrences  of  the  j-th  descriptor  of  the  request  in 
the  retrieval  patterns  of  documents:  -  ex¬ 
perimental  values;  -  calculated  values. 

Really  the  analysis  of  the  variability  of  functionals  showed  that 
their  mean  square  deviations  in  value  are  close  to  mathematical 
expectations.  Therefore  it  is  possible  to  find  the  relative  error 
in  the  determination  of  mathematical  expectation  from  its  mean  value, 
calculated  in  the  group  of  tr  requests: 

.  3 o  , 

{15) 

where  M  -  mathematical  expectation  of  the  investigated  functional; 

a  -  the  mean  square  deviation  of  this  functional  from  the  mathematical 
0 

expectation;  oi«*p=-—  the  mean  square  deviation  of  the  mean  value  of 
the  functional  from  the  mathematical  expectation. 


Since  iM«o, 


(16) 


The  relative  error,  calculated  using  this  formula  for  every 
group  of  requests  depending  on  the  number  of  descriptors  in  the 
request,  is  shown  in  Table  3:  for  the  few  groups  of  one-  and  seven- 
descriptor  requests  the  mathematical  expectations  of  functionals 
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Table  3.  The  error  of  the  experimental  evaluation  of  functional, 
connected  with  the  distribution  of  descriptors  Inside  the  request. 

Number  of  descriptors  in  the 

request,  z . .  1  2  3  4  5  6  7 

Number  of  requests  in  the 

group,  t, .  4  29  85  72  37  18  2 

Relative  error 

.  100*  55*  32*  35*  49*  78*  100% 

can  be  obtained  from  their  mean  values  with  a  relative  error  which 
reaches  100*.  In  the  remaining  sases  this  error  can  reach  30*  and 
more,  which  fully  explains  the  significant  divergences  of  the 
experimental  values  of  functionals  from  the  calculated. 

Nevertheless  the  experiment  again  confirms  the  validity  of  the 
accepted  hypotheses,  and  also  the  suitability  of  the  formulas 
derived  in  [1],  for  the  estimated  calculations  of  functionals. 

CONCLUSIONS 

On  the  basis  of  the  experiments  conducted  it  is  possible  to 
draw  the  following  conclusions. 

The  appearance  frequency  of  descriptors  In  the  retrieval 
patterns  of  documents  and  requests  is  actually  subordinated  to  the 
laws  of  Zlpf  and  Mandelbrot,  which  are  known  for  the  distribution 
of  words  in  natural  languages. 

The  frequency  of  occurrence  of  descriptors  in  the  retrieval 
patterns  of  documents  is  determined  by  the  Joint  interaction  of  a 
number  of  random  factors  such  as  the  mean  frequency  of  use  of  the 
keywords  of  the  descriptor;  the  number  of  keywords  in  the  class  of 
equivalency  of  the  descriptor;  the  number  of  references  to  the  given 
descriptor  on  the  part  of  subordinates.  Each  of  these  factors 
influences  the  distribution  of  descriptors  separately  to  a  considerable 
degree  which,  nevertheless,  is  subordinated  to  the  Zipf  law.  This 

confirms  the  assumption  about  the  deductibility  of  the  Zipf  law  from 
Bernoulli’s  model. 


FTD-MT-24 -14 56-71 


16 


The  frequency  dictionaries  of  descriptors  for  documents  and 
requests  are  statistically  close  to  each  other,  which  confirms  the 
hypothesis  about  the  agreement  of  the  laws  of  distribution  of 
descriptors  in  the  retrieval  patterns  of  documents  and  requests. 

The  formulas  of  mathematical  expectation  of  rank  and  frequency 
of  occurrence  of  the  descriptors  of  a  request  in  the  retrieval 
patterns  of  documents,  which  were  derived  in  work  [1]  on  the  basis 
of  this  hypothesis  with  the  observance  of  the  Zipf  law,  an:  completely 
suitable  for  the  estimated  calculations  of  these  values. 

The  author  thanks  P.  N.  Sapozhnikov  for  the  attentive  review 
of  the  manuscript  and  valuable  remarks. 
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